** Lets meet a very nice guy: dplyr **
Maybe, the most of our time, will be spent on talking to our data
One of the most important type of words on our languages are the verbs
This make sense because they execute actions, and that is what we’ll want
We are going to need take some variable
and make a new one from it
Perhaps, there are going to be sometimes when we would like to know the possible cases that would come from the combination of two categorical variables and calculate the frequency of each scenario (distribution) and evenresume the information than’d come from it
Another times, we’d just need to order our dataset from some specific variable or variables in order to watch the biggest (or the smallest) ones at above (or at the bottom)
And finally what we’d want to tell our data is that we only want to see those cases (rows) than would fulfill some condition
Quoting from - dplyr’s github:
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
select() picks variables based on their names.
mutate() adds new variables that are functions of existing variables.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.
filter() picks cases based on their values.
#source("script_mns.R")
# library("nycflights13")
# saveRDS(flights,"data/flights.rds")
# saveRDS(airlines,"data/airlines.rds")
# library(tidyverse)
# mpg
flights<-readRDS("data/flights.rds")
# airlines<-readRDS("data/airlines.rds")
flights
selectlibrary(dplyr)
# Drop unimportant variables so it's easier to understand the join results.
flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
mutateflights %>% mutate(I_tarde=if_else(arr_delay>0,1,0))
#que lindo detalle
flights%>%
select(sched_arr_time,arr_time,arr_delay)%>%
mutate(I_tarde=if_else(arr_delay>0,1,0))
arrangeflights%>%arrange(year)
flights%>%
arrange(year,month,day)
flights%>%
select(year,month,sched_dep_time,sched_arr_time,arr_time,arr_delay)%>%
mutate(I_tarde=if_else(arr_delay>0,1,0))%>%
arrange(desc(month),sched_dep_time,desc(I_tarde))
filterflights%>%
filter(sched_dep_time<1200)
flights%>%
select(year,month,sched_dep_time,sched_arr_time,arr_time,arr_delay)%>%
mutate(I_tarde=if_else(arr_delay>0,1,0))%>%
filter(sched_dep_time>=0001 & sched_dep_time<600 & I_tarde==1)
flights2<-flights%>%
mutate(I_tarde=if_else(arr_delay>0,1,0))
# is.na(flights2$I_tarde)
sum(is.na(flights2$I_tarde)); print(paste0("The 'NA's represent the: ", round(sum(is.na(flights2$I_tarde))/nrow(flights2),4)*100,"% of the Universe"))
[1] 9430
[1] "The 'NA's represent the: 2.8% of the Universe"
flights2%>%
filter(is.na(I_tarde))
Option 1: Take them out of here
# flights2[!is.na(flights2$I_tarde),]
flights2_opt1<-flights2%>%
filter(!is.na(I_tarde))
flights2_opt1
Option 2: Special Value
flights2_opt2<-flights2%>%
mutate(I_tarde=if_else(is.na(I_tarde),-9999,I_tarde))
flights2_opt2; sum(is.na(flights2_opt2$I_tarde)); print(paste0("The 'NA's represent the: ", round(sum(is.na(flights2_opt2$I_tarde))/nrow(flights2_opt2),4)*100,"% of the Universe")); sum(flights2_opt2$I_tarde==-9999); print(paste0("The Special Values represent the: ", round(sum(flights2_opt2$I_tarde==-9999)/nrow(flights2_opt2),4)*100,"% of the Universe"))
[1] 0
[1] "The 'NA's represent the: 0% of the Universe"
[1] 9430
[1] "The Special Values represent the: 2.8% of the Universe"
Option N & beyond: More sophisticated stuff
print(important_mens)
[1] " We won't see any of these :D "
group_byairlines<-readRDS("data/airlines.rds")
airlines